AITopics | phoneme recognition

Collaborating Authors

phoneme recognition

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Hessian-free Optimization for Learning Deep Multidimensional Recurrent Neural Networks

Minhyung Cho, Chandra Dhir, Jaehyung Lee

Neural Information Processing SystemsOct-2-2025, 12:08:39 GMT

Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.

approximation, optimization, sequence, (16 more...)

Neural Information Processing Systems

Country:

South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

TSPC: A Two-Stage Phoneme-Centric Architecture for code-switching Vietnamese-English Speech Recognition

Nguyen, Minh N. H., Tran, Anh Nguyen, Dinh, Dung Truong, Van Vo, Nam

arXiv.org Artificial IntelligenceSep-23-2025

Code-switching (CS) presents a significant challenge for general Auto-Speech Recognition (ASR) systems. Existing methods often fail to capture the subtle phonological shifts inherent in CS scenarios. The challenge is particularly difficult for language pairs like Vietnamese and English, where both distinct phonological features and the ambiguity arising from similar sound recognition are present. In this paper, we propose a novel architecture for Vietnamese-English CS ASR, a Two-Stage Phoneme-Centric model (TSPC). The TSPC employs a phoneme-centric approach, built upon an extended Vietnamese phoneme set as an intermediate representation to facilitate mixed-lingual modeling. Experimental results demonstrate that TSPC consistently outperforms existing baselines, including PhoWhisper-base, in Vietnamese-English CS ASR, achieving a significantly lower word error rate of 19.9% with reduced training resources. Furthermore, the phonetic-based two-stage architecture enables phoneme adaptation and language conversion to enhance ASR performance in complex CS Vietnamese-English ASR scenarios

machine learning, natural language, recognition, (18 more...)

arXiv.org Artificial Intelligence

2509.05983

Country: Asia > Vietnam (0.29)

Genre: Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Graph Connectionist Temporal Classification for Phoneme Recognition

Grafé, Henry, Van hamme, Hugo

arXiv.org Artificial IntelligenceSep-9-2025

Automatic Phoneme Recognition (APR) systems are often trained using pseudo phoneme-level annotations generated from text through Grapheme-to-Phoneme (G2P) systems. These G2P systems frequently output multiple possible pronunciations per word, but the standard Connectionist Temporal Classification (CTC) loss cannot account for such ambiguity during training. In this work, we adapt Graph Temporal Classification (GTC) to the APR setting. GTC enables training from a graph of alternative phoneme sequences, allowing the model to consider multiple pronunciations per word as valid supervision. Our experiments on English and Dutch data sets show that incorporating multiple pronunciations per word into the training loss consistently improves phoneme error rates compared to a baseline trained with CTC. These results suggest that integrating pronunciation variation into the loss function is a promising strategy for training APR systems from noisy G2P-based supervision.

artificial intelligence, machine learning, pronunciation, (17 more...)

arXiv.org Artificial Intelligence

2509.05399

Country:

Europe (1.00)
Asia (1.00)
North America > United States > California (0.28)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Whisper based Cross-Lingual Phoneme Recognition between Vietnamese and English

Minh, Nguyen Huu Nhat, Anh, Tran Nguyen, Dung, Truong Dinh, Van Nam, Vo, Tuyen, Le Pham

arXiv.org Artificial IntelligenceAug-28-2025

Cross-lingual phoneme recognition has emerged as a significant challenge for accurate automatic speech recognition (ASR) when mixing Vietnamese and English pronunciations. Unlike many languages, Vietnamese relies on tonal variations to distinguish word meanings, whereas English features stress patterns and non-standard pronunciations that hinder phoneme alignment between the two languages. To address this challenge, we propose a novel bilingual speech recognition approach with two primary contributions: (1) constructing a representative bilingual phoneme set that bridges the differences between Vietnamese and English phonetic systems; (2) designing an end-to-end system that leverages the PhoWhisper pre-trained encoder for deep high-level representations to improve phoneme recognition. Our extensive experiments demonstrate that the proposed approach not only improves recognition accuracy in bilingual speech recognition for Vietnamese but also provides a robust framework for addressing the complexities of tonal and stress-based phoneme recognition.

machine learning, natural language, recognition, (19 more...)

arXiv.org Artificial Intelligence

2508.1927

Country: Asia > Vietnam (0.28)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.96)

Add feedback

CUPE: Contextless Universal Phoneme Encoder for Language-Agnostic Speech Processing

Rehman, Abdul, Zhang, Jian-Jun, Yang, Xiaosong

arXiv.org Artificial IntelligenceAug-22-2025

Universal phoneme recognition typically requires analyzing long speech segments and language-specific patterns. Many speech processing tasks require pure phoneme representations free from contextual influence, which motivated our development of CUPE - a lightweight model that captures key phoneme features in just 120 milliseconds, about one phoneme's length. CUPE processes short, fixed-width windows independently and, despite fewer parameters than current approaches, achieves competitive cross-lingual performance by learning fundamental acoustic patterns common to all languages. Our extensive evaluation through supervised and self-supervised training on diverse languages, including zero-shot tests on the UCLA Phonetic Corpus, demonstrates strong cross-lingual generalization and reveals that effective universal speech processing is possible through modeling basic acoustic patterns within phoneme-length windows.

machine learning, natural language, recognition, (16 more...)

arXiv.org Artificial Intelligence

2508.15316

Country: Europe > United Kingdom (0.14)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.47)

Add feedback

Towards disentangling the contributions of articulation and acoustics in multimodal phoneme recognition

Foley, Sean, Nguyen, Hong, Lee, Jihwan, Kadiri, Sudarsana Reddy, Byrd, Dani, Goldstein, Louis, Narayanan, Shrikanth

arXiv.org Artificial IntelligenceJun-2-2025

Although many previous studies have carried out multimodal learning with real-time MRI data that captures the audio-visual kinematics of the vocal tract during speech, these studies have been limited by their reliance on multi-speaker corpora. This prevents such models from learning a detailed relationship between acoustics and articulation due to considerable cross-speaker variability. In this study, we develop unimodal audio and video models as well as multimodal models for phoneme recognition using a long-form single-speaker MRI corpus, with the goal of disentangling and interpreting the contributions of each modality. Audio and multimodal models show similar performance on different phonetic manner classes but diverge on places of articulation. Interpretation of the models' latent space shows similar encoding of the phonetic space across audio and multimodal models, while the models' attention weights highlight differences in acoustic and articulatory timing for certain phonemes.

artificial intelligence, machine learning, recognition, (16 more...)

arXiv.org Artificial Intelligence

2505.24059

Country: North America > United States > California (0.29)

Genre: Research Report > New Finding (0.67)

Industry: Health & Medicine (0.47)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.69)

Add feedback

Self-Supervised Models for Phoneme Recognition: Applications in Children's Speech for Reading Learning

Medin, Lucas Block, Pellegrini, Thomas, Gelin, Lucile

arXiv.org Artificial IntelligenceMar-6-2025

Child speech recognition is still an underdeveloped area of research due to the lack of data (especially on non-English languages) and the specific difficulties of this task. Having explored various architectures for child speech recognition in previous work, in this article we tackle recent self-supervised models. We first compare wav2vec 2.0, HuBERT and WavLM models adapted to phoneme recognition in French child speech, and continue our experiments with the best of them, WavLM base+. We then further adapt it by unfreezing its transformer blocks during fine-tuning on child speech, which greatly improves its performance and makes it significantly outperform our base model, a Transformer+CTC. Finally, we study in detail the behaviour of these two models under the real conditions of our application, and show that WavLM base+ is more robust to various reading tasks and noise levels. Index Terms: speech recognition, child speech, self-supervised learning

artificial intelligence, machine learning, speech, (15 more...)

arXiv.org Artificial Intelligence

doi: 10.21437/Interspeech.2024-1095

2503.0471

Country:

Europe > France > Occitanie > Haute-Garonne > Toulouse (0.05)
North America > United States (0.04)
Europe > France > Île-de-France > Paris > Paris (0.04)
(2 more...)

Genre: Research Report (1.00)

Industry: Education (0.93)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Biomimetic Frontend for Differentiable Audio Processing

Famularo, Ruolan Leslie, Zotkin, Dmitry N., Shamma, Shihab A., Duraiswami, Ramani

arXiv.org Artificial IntelligenceSep-13-2024

While models in audio and speech processing are becoming deeper and more end-to-end, they as a consequence need expensive training on large data, and are often brittle. We build on a classical model of human hearing and make it differentiable, so that we can combine traditional explainable biomimetic signal processing approaches with deep-learning frameworks. This allows us to arrive at an expressive and explainable model that is easily trained on modest amounts of data. We apply this model to audio processing tasks, including classification and enhancement. Results show that our differentiable model surpasses black-box approaches in terms of computational efficiency and robustness, even with little training data. We also discuss other potential applications.

frontend, modulation, recognition, (15 more...)

arXiv.org Artificial Intelligence

2409.08997

Country:

North America > United States > New York (0.04)
North America > United States > Maryland > Prince George's County > College Park (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report > New Finding (0.34)

Industry: Health & Medicine > Therapeutic Area > Neurology (0.47)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Speaker- and Text-Independent Estimation of Articulatory Movements and Phoneme Alignments from Speech

Weise, Tobias, Klumpp, Philipp, Demir, Kubilay Can, Pérez-Toro, Paula Andrea, Schuster, Maria, Noeth, Elmar, Heismann, Bjoern, Maier, Andreas, Yang, Seung Hee

arXiv.org Artificial IntelligenceJul-3-2024

This paper introduces a novel combination of two tasks, previously treated separately: acoustic-to-articulatory speech inversion (AAI) and phoneme-to-articulatory (PTA) motion estimation. We refer to this joint task as acoustic phoneme-to-articulatory speech inversion (APTAI) and explore two different approaches, both working speaker- and text-independently during inference. We use a multi-task learning setup, with the end-to-end goal of taking raw speech as input and estimating the corresponding articulatory movements, phoneme sequence, and phoneme alignment. While both proposed approaches share these same requirements, they differ in their way of achieving phoneme-related predictions: one is based on frame classification, the other on a two-staged training procedure and forced alignment. We reach competitive performance of 0.73 mean correlation for the AAI task and achieve up to approximately 87% frame overlap compared to a state-of-the-art text-dependent phoneme force aligner.

alignment, phoneme sequence, speech inversion, (10 more...)

arXiv.org Artificial Intelligence

2407.03132

Country:

Europe > Germany > Bavaria > Middle Franconia > Nuremberg (0.04)
South America > Colombia > Antioquia Department > Medellín (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Speech (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

TIPAA-SSL: Text Independent Phone-to-Audio Alignment based on Self-Supervised Learning and Knowledge Transfer

Tits, Noé, Bhatnagar, Prernna, Dutoit, Thierry

arXiv.org Artificial IntelligenceMay-3-2024

In this paper, we present a novel approach for text independent phone-to-audio alignment based on phoneme recognition, representation learning and knowledge transfer. Our method leverages a self-supervised model (wav2vec2) fine-tuned for phoneme recognition using a Connectionist Temporal Classification (CTC) loss, a dimension reduction model and a frame-level phoneme classifier trained thanks to forced-alignment labels (using Montreal Forced Aligner) to produce multi-lingual phonetic representations, thus requiring minimal additional training. We evaluate our model using synthetic native data from the TIMIT dataset and the SCRIBE dataset for American and British English, respectively. Our proposed model outperforms the state-of-the-art (charsiu) in statistical metrics and has applications in language learning and speech processing systems. We leave experiments on other languages for future work but the design of the system makes it easily adaptable to other languages.

dataset, phoneme recognition, representation, (12 more...)

arXiv.org Artificial Intelligence

2405.02124

Country:

North America > Canada > Quebec > Montreal (0.24)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report > Promising Solution (0.66)

Industry: Education (0.50)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)

Add feedback